Predicting Dry-bean Variety Types Using Classification Algorithms¶
David Dilsuk and Owen O'Connor
The Problem¶
Dry beans are an important food crop that is grown around the world. In 2016, more than 27 million tonnes of dry beans were harvested globally. Our project tackles a classification problem relating to dry beans varieties grown in Turkey.
According to the authors of a 2020 study, only a small percentage of the beans cultivated in Turkey come from certified seed that only contains one variety of seed. Most of the bean cultivation has some level of diversity to it, and the beans need to be sorted after harvest to allow farmers to get the best prices for their products. Much like bringing a load of unsorted metal to a scrap yard, and mixed harvest of bean varieties gets devalued and discounted. Because the sorting process is tedious and labor intensive, effort is being made to explore ways that computers can assist in the bean sorting process.
For our project, we use a dataset that contains computer collected measurements of 7 different types of Turkish dry beans. Our goal is to attempt to successfully classify the dry bean variety of individual beans by using their measured physical characteristics as the predictive variables.
The Data¶
Our dataset contains information from 13,000+ individual drybeans. The target variable is been variety, of which there are 7: Barbunya, Battal, Bombay, Calı, Dermason, Horoz, Tombul, Selanik and Seker. The dataset contains 16 feature columns, which are physical measurements of the beans obtained through computer vision measurement. Specifically the features available are: | Variable Name | Type | Description | | ------------- |:-------------:| -----| | Area | Integer | The area of a bean zone and the number of pixels within its boundaries | | Perimeter | Continuous | Bean circumference is defined as the length of its border | | MajorAxisLength | Continuous | The distance between the ends of the longest line that can be drawn from a bean | | MinorAxisLength | Continuous | The longest line that can be drawn from the bean while standing perpendicular to the main axis | | AspectRatio | Continuous | Defines the relationship between MajorAxisLength and MinorAxisLength | | Eccentricity | Continuous | Eccentricity of the ellipse having the same moments as the region | | ConvexArea | Integer | Number of pixels in the smallest convex polygon that can contain the area of a bean seed | | EquivDiameter | Equivalent diameter: The diameter of a circle having the same area as a bean seed area | | Extent | Continuous | The ratio of the pixels in the bounding box to the bean area | | Solidity | Continuous | Also known as convexity. The ratio of the pixels in the convex shell to those found in beans. | | Roundness | Continuous | Calculated with the following formula: (4piA)/(P^2) | | Compactness | Continuous | Measures the roundness of an objec | | ShapeFactor1 | Continuous | | | ShapeFactor2 | Continuous | | | ShapeFactor3 | Continuous | | | ShapeFactor4 | Continuous | |
Our Approach¶
Our machine learning task is a multi-class classification problem, and we have a clean, numeric dataset. We have decided to use two of the most effective classification models for this classification problem: Support Vector Machines (SVM) and Random Forests. For both models types, we will test the predictive accuracy using a baseline classifier with default values. We will then explore different hyperparameter and feature options to see if predictive power can be improved. Finally we will select the best models and evaluate them against a test set.
Before proceeding with that, we will take a moment to explore our data.
Exploratory Data Analysis¶
import numpy as np
from matplotlib import pylab as plt
import pandas as pd
np.set_printoptions(precision=4)
# Runs the util notebook so that those functions are available
%run utils.ipynb
# Load our dataset
X,y, features = load_beans()
First thing that we will do is just take look at the histograms of each of our feature variables.
# Make a histogram of each variable
fig, ax = plt.subplots(nrows=6, ncols=3, figsize=(12, 18))
feat = 0
for i in range(6):
for j in range(3):
if (feat < len(features)):
ax[i, j].hist(X[:, feat], bins = 30)
ax[i, j].set_title(features[feat])
feat += 1
plt.tight_layout()
plt.show()
While we can see that the feature variables have different distributions, probably the biggest thing we notice is that the variables have pretty different scales, and it will likely be advantageous for us to standardize the data. This is reinforced by seeing the ranges in the summary statistics of the dataset or by looking at the boxplots of the individual features:
#print out summary statistics
beans = pd.read_csv('https://raw.githubusercontent.com/oroconnor/CS345-Project/main/dry%2Bbean%2Bdataset/DryBeanDataset/Dry_Bean_Dataset.csv', delimiter=',')
beans.describe()
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13611.000000 |
| mean | 53048.284549 | 855.283459 | 320.141867 | 202.270714 | 1.583242 | 0.750895 | 53768.200206 | 253.064220 | 0.749733 | 0.987143 | 0.873282 | 0.799864 | 0.006564 | 0.001716 | 0.643590 | 0.995063 |
| std | 29324.095717 | 214.289696 | 85.694186 | 44.970091 | 0.246678 | 0.092002 | 29774.915817 | 59.177120 | 0.049086 | 0.004660 | 0.059520 | 0.061713 | 0.001128 | 0.000596 | 0.098996 | 0.004366 |
| min | 20420.000000 | 524.736000 | 183.601165 | 122.512653 | 1.024868 | 0.218951 | 20684.000000 | 161.243764 | 0.555315 | 0.919246 | 0.489618 | 0.640577 | 0.002778 | 0.000564 | 0.410339 | 0.947687 |
| 25% | 36328.000000 | 703.523500 | 253.303633 | 175.848170 | 1.432307 | 0.715928 | 36714.500000 | 215.068003 | 0.718634 | 0.985670 | 0.832096 | 0.762469 | 0.005900 | 0.001154 | 0.581359 | 0.993703 |
| 50% | 44652.000000 | 794.941000 | 296.883367 | 192.431733 | 1.551124 | 0.764441 | 45178.000000 | 238.438026 | 0.759859 | 0.988283 | 0.883157 | 0.801277 | 0.006645 | 0.001694 | 0.642044 | 0.996386 |
| 75% | 61332.000000 | 977.213000 | 376.495012 | 217.031741 | 1.707109 | 0.810466 | 62294.000000 | 279.446467 | 0.786851 | 0.990013 | 0.916869 | 0.834270 | 0.007271 | 0.002170 | 0.696006 | 0.997883 |
| max | 254616.000000 | 1985.370000 | 738.860154 | 460.198497 | 2.430306 | 0.911423 | 263261.000000 | 569.374358 | 0.866195 | 0.994677 | 0.990685 | 0.987303 | 0.010451 | 0.003665 | 0.974767 | 0.999733 |
# Make a boxplot of each variable - ranges are too different to show all together
fig, ax = plt.subplots(nrows=6, ncols=3, figsize=(12, 18))
feat = 0
for i in range(6):
for j in range(3):
if (feat < len(features)):
ax[i, j].boxplot(X[:, feat])
ax[i, j].set_title(features[feat])
feat += 1
plt.tight_layout()
plt.show()
Next we will take a look at some scatter plots of the data, with the target variable labelled by color. We are trying to get an appreciation of how separated the data is in regards to the target variable. With 16 features, this is a lot of pairs to look at, but it still seems worth exploring:
# Scatterplots, colored by target classifications
from sklearn.preprocessing import LabelEncoder
ynums = LabelEncoder().fit_transform(y)
nrows = 64
ncols = 4
fig, ax = plt.subplots(nrows = nrows, ncols = ncols, figsize=(15, 3 * nrows))
i = 0
j = 0
for feat1 in range (len(features)):
for feat2 in range (len(features)):
if ((feat2 < len(features)) and (feat1 < len(features))):
#print("ax i : " , i)
#print("ax j: " , j)
ax[i, j].scatter(X[:, feat1], X[:, feat2], c=ynums, alpha=0.5, s=30)
ax[i, j].set_xlabel(features[feat1])
ax[i, j].set_ylabel(features[feat2])
j += 1
if j == 4:
j = 0
i += 1
plt.tight_layout()
plt.show()